feat[Go]: implement chunk REST APIs (list, add, update, switch) by hunnyboy1217 · Pull Request #15670 · infiniflow/ragflow

hunnyboy1217 · 2026-06-04T14:06:05Z

What problem does this PR solve?

Closes #15668 — ports four missing chunk REST endpoints from Python (api/apps/restful_apis/chunk_api.py) to the Go API layer.

The following were already in Go: GET .../chunks/:chunk_id and DELETE .../chunks. This PR adds the remaining four:

Method	Path	Handler	Python source
GET	`/api/v1/datasets/:dataset_id/documents/:document_id/chunks`	`ListChunksREST`	`list_chunks`
POST	`/api/v1/datasets/:dataset_id/documents/:document_id/chunks`	`AddChunk`	`add_chunk`
PATCH	`/api/v1/datasets/:dataset_id/documents/:document_id/chunks/:chunk_id`	`UpdateChunkREST`	`update_chunk`
PATCH	`/api/v1/datasets/:dataset_id/documents/:document_id/chunks`	`SwitchChunks`	`switch_chunks`

Type of change

New Feature (non-breaking change which adds functionality)

Implementation notes

Files changed:

internal/service/chunk.go  – helpers + four service methods
internal/handler/chunk.go  – four HTTP handlers
internal/router/router.go  – route registration

Functional parity table:

Concern	Go behaviour
Dataset access	`resolveDatasetAccess` iterates user tenants, calls `kbDAO.GetByIDAndTenantID` — mirrors `KnowledgebaseService.accessible`
Document ownership	`documentDAO.GetByID` + `doc.KbID == datasetID` — mirrors `DocumentService.query(id=…, kb_id=…)`
Chunk ID generation	`xxhash.Sum64String(content + documentID)` — same as Python `xxhash.xxh64(…).hexdigest()`
Tokenisation	`tokenizer.Tokenize` + `tokenizer.FineGrainedTokenize` (existing Go tokenizer)
Embedding	`getEmbeddingModelForKB` resolves KB → embedding model (same lookup chain as `RetrievalTest`); `0.1 * docNameVec + 0.9 * contentVec` weighted average
Document stats	`gorm.Expr("chunk_num + 1")` / `gorm.Expr("token_num + ?", count)` via `documentDAO.UpdateByID` — mirrors `DocumentService.increment_chunk_num`
UpdateChunkREST	Re-embeds only when `content` or `questions` changes; all other fields (available, positions, tag_kwd, tag_feas) updated without re-embedding
SwitchChunks	Bulk `UpdateChunks` with `available_int` per chunk — mirrors Python's `docStoreConn.update` loop
Response shape	`AddChunk` returns `{"chunk": {…}}` with Python `key_mapping` field renames; `ListChunksREST` mirrors `_strip_chunk_runtime_fields` output

coderabbitai · 2026-06-04T14:06:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds dataset/document-scoped chunk REST endpoints (list with pagination/filters, add with deterministic IDs + embeddings, partial update with conditional re-embed, and bulk availability toggle), router wiring, access/embedding helpers, and response normalization.

Changes

Chunk REST API Implementation

Layer / File(s)	Summary
Handler endpoints and imports `internal/handler/chunk.go`	Adds `strconv` import and four Gin handlers: `ListChunksREST`, `AddChunk`, `UpdateChunkREST`, `SwitchChunks` that parse path/query/body and forward requests to `ChunkService`.
Router wiring `internal/router/router.go`	Registers collection and item routes under `/api/v1/datasets/:dataset_id/documents/:document_id/chunks` for list, add, switch, get, patch, and delete.
Service imports & helpers `internal/service/chunk.go`	Adds `time`, `xxhash/v2`, `gorm` imports and implements `resolveDatasetAccess`, `getEmbeddingModelForKB`, `embedTexts`, `estimateTokenCount`, and `weightedVec`.
ListChunks (service + handler) `internal/service/chunk.go`, `internal/handler/chunk.go`	`ListChunksREST` builds a doc-engine search scoped to dataset/document with pagination, optional `keywords` and `available` filter; includes `orSlice`, `orStr`, `intToBool` normalizers and handler-side `page`/`page_size` parsing/capping.
AddChunk (service + handler) `internal/service/chunk.go`, `internal/handler/chunk.go`	`AddChunk` validates ownership/content, computes deterministic `chunkID` via `xxhash(content+documentID)`, tokenizes/normalizes questions/keywords, generates embeddings from `[doc.Name, embedInput]`, stores a weighted vector key, inserts chunk into doc engine, and updates parent document counters; handler binds `AddChunkRequest`.
UpdateChunk (service + handler) `internal/service/chunk.go`, `internal/handler/chunk.go`	`UpdateChunkREST` validates ownership and chunk scope, applies partial updates, re-embeds only when `content` or `questions` change, and persists updates; handler binds `UpdateChunkRESTRequest`.
SwitchChunks (service + handler) `internal/service/chunk.go`, `internal/handler/chunk.go`	`SwitchChunks` validates `chunk_ids`, resolves ownership, computes `available_int`, iterates updating each chunk's availability while logging per-item failures; handler validates payload and returns result.

Sequence Diagram — AddChunk flow

sequenceDiagram
  participant Client
  participant ChunkHandler
  participant ChunkService
  participant DocEngine
  participant EmbeddingService
  Client->>ChunkHandler: POST /datasets/:ds/documents/:doc/chunks {AddChunkRequest}
  ChunkHandler->>ChunkService: AddChunk(datasetID, documentID, userID, req)
  ChunkService->>DocEngine: Verify document belongs to KB / insert chunk
  ChunkService->>EmbeddingService: embedTexts([doc.Name, embedInput])
  EmbeddingService-->>ChunkService: vector
  ChunkService->>DocEngine: store chunk + q_<len>_vec
  ChunkService-->>ChunkHandler: {"chunk": {...}}
  ChunkHandler-->>Client: HTTP 200 {code,data,message}

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

infiniflow/ragflow#15749: Related work reading/wrapping q_<dim>_vec chunk vectors used by this PR.
infiniflow/ragflow#15055: Router/GET chunk endpoint changes that align with this PR’s new chunk REST wiring.
infiniflow/ragflow#14208: Example scripts exercising dataset/document chunk management APIs added here.

Suggested labels

💞 feature

Suggested reviewers

Haruko386
Hz-186
JinHai-CN

"A Rabbit's Note on Chunks"
I hopped through code and seeded names,
Deterministic hashes, tiny claims.
Embeddings hum and counters climb,
Chunks now line up, neat and fine.
✨🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title accurately and concisely describes the main change: implementation of four chunk REST API endpoints (list, add, update, switch) in Go.
Description check	✅ Passed	The PR description comprehensively addresses the template by explaining the problem solved (porting four Python endpoints to Go), specifying the type of change (New Feature), documenting all four endpoints with paths and handlers, and providing detailed implementation notes including functional parity with Python behavior.
Linked Issues check	✅ Passed	The PR implements four of the seven chunk endpoints required by `#15668`: ListChunksREST, AddChunk, UpdateChunkREST, and SwitchChunks. The implementation achieves behavioral parity with Python sources via dataset/document access validation, chunk ID generation, tokenization, embedding, and document stat updates.
Out of Scope Changes check	✅ Passed	All changes are scoped to implementing the four chunk REST endpoints required by `#15668` (list, add, update, switch). The modifications to internal/handler/chunk.go, internal/service/chunk.go, and internal/router/router.go are directly aligned with the PR objectives and do not introduce unrelated changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

internal/service/chunk.go (1)

1360-1361: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Same panic risk with tokenizer.Tokenize result cast.

This has the same issue as in AddChunk - direct type assertion without safety check.

🛡️ Proposed fix

 	// Tokenize.
 	contentLtks, _ := tokenizer.Tokenize(content)
-	contentSmLtks, _ := tokenizer.FineGrainedTokenize(contentLtks.(string))
+	contentLtksStr, ok := contentLtks.(string)
+	if !ok {
+		contentLtksStr = ""
+	}
+	contentSmLtks, _ := tokenizer.FineGrainedTokenize(contentLtksStr)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/chunk.go` around lines 1360 - 1361, The code unsafely
type-asserts tokenizer.Tokenize(...) result to string (contentLtks.(string))
which can panic; change the call sites (where contentLtks and contentSmLtks are
set) to use the two-value type assertion or a type switch to verify the returned
value is a string before using it, and propagate or handle errors from
tokenizer.Tokenize and tokenizer.FineGrainedTokenize instead of ignoring them;
update the logic around variables contentLtks and contentSmLtks and the calls to
tokenizer.Tokenize and tokenizer.FineGrainedTokenize so you return/log an error
or bail out if the result type is unexpected.

🧹 Nitpick comments (1)

internal/handler/chunk.go (1)

367-381: 💤 Low value

Consider using strconv.Atoi instead of json.Number for query parameter parsing.

Using json.Number(v).Int64() for string-to-int conversion is unconventional. strconv.Atoi or strconv.ParseInt is more idiomatic for parsing query parameters and avoids the intermediate json.Number type.

♻️ Suggested refactor

 	page := 1
 	if v := c.Query("page"); v != "" {
-		if p, err := json.Number(v).Int64(); err == nil && p > 0 {
+		if p, err := strconv.Atoi(v); err == nil && p > 0 {
 			page = int(p)
 		}
 	}
 	pageSize := 30
 	if v := c.Query("page_size"); v != "" {
-		if ps, err := json.Number(v).Int64(); err == nil && ps > 0 {
+		if ps, err := strconv.Atoi(v); err == nil && ps > 0 {
 			if ps > 100 {
 				ps = 100
 			}
-			pageSize = int(ps)
+			pageSize = ps
 		}
 	}

This would also require adding "strconv" to the imports.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/handler/chunk.go` around lines 367 - 381, The current parsing of
query params for pagination (page and pageSize variables in
internal/handler/chunk.go) uses json.Number(v).Int64(), which is unconventional;
replace those conversions with strconv.Atoi (or strconv.ParseInt) to parse
c.Query("page") and c.Query("page_size"), keep the same validation (positive
values and cap pageSize to 100), and add "strconv" to imports; ensure you still
default page=1 and pageSize=30 when parsing fails or values are empty.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/service/chunk.go`:
- Around line 1431-1459: SwitchChunks currently updates chunks by id without
verifying they belong to documentID; modify SwitchChunks to include the document
ownership check: when building the update condition passed to
s.docEngine.UpdateChunks include "doc_id": documentID (in addition to "id":
chunkID) or first fetch the chunk via s.docEngine (e.g., GetChunk/GetChunks) and
verify chunk.DocID equals documentID before calling UpdateChunks; ensure you
still respect tenantID/indexName/datasetID and keep available_int logic and
error logging (use the same condition/update variable names to locate the
change).

---

Duplicate comments:
In `@internal/service/chunk.go`:
- Around line 1360-1361: The code unsafely type-asserts tokenizer.Tokenize(...)
result to string (contentLtks.(string)) which can panic; change the call sites
(where contentLtks and contentSmLtks are set) to use the two-value type
assertion or a type switch to verify the returned value is a string before using
it, and propagate or handle errors from tokenizer.Tokenize and
tokenizer.FineGrainedTokenize instead of ignoring them; update the logic around
variables contentLtks and contentSmLtks and the calls to tokenizer.Tokenize and
tokenizer.FineGrainedTokenize so you return/log an error or bail out if the
result type is unexpected.

---

Nitpick comments:
In `@internal/handler/chunk.go`:
- Around line 367-381: The current parsing of query params for pagination (page
and pageSize variables in internal/handler/chunk.go) uses
json.Number(v).Int64(), which is unconventional; replace those conversions with
strconv.Atoi (or strconv.ParseInt) to parse c.Query("page") and
c.Query("page_size"), keep the same validation (positive values and cap pageSize
to 100), and add "strconv" to imports; ensure you still default page=1 and
pageSize=30 when parsing fails or values are empty.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 85724050-8e76-4982-be04-53bbaa180426

📥 Commits

Reviewing files that changed from the base of the PR and between 98f2a2e and 722d990.

📒 Files selected for processing (3)

internal/handler/chunk.go
internal/router/router.go
internal/service/chunk.go

hunnyboy1217 · 2026-06-05T05:54:38Z

Hi, @Haruko386 ,
Please review this PR.

Haruko386 · 2026-06-08T05:45:11Z

Hi @hunnyboy1217 I will review your code soon

Haruko386

@hunnyboy1217 plz fix CI error first

Hz-186 · 2026-06-08T08:32:38Z

@hunnyboy1217
and fix the conflicts plz

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

internal/service/chunk.go (1)

1458-1474: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Scope the bulk update to the requested document.

documentID is never used after access resolution, and each update filters only by id. This endpoint is still not document-scoped, so supplied chunk IDs from other documents in the same dataset can be toggled here.

🩹 Proposed fix

 	for _, chunkID := range chunkIDs {
-		condition := map[string]interface{}{"id": chunkID}
+		condition := map[string]interface{}{"id": chunkID, "doc_id": documentID}
 		update := map[string]interface{}{"available_int": availInt}
 		if err := s.docEngine.UpdateChunks(ctx, condition, update, indexName, datasetID); err != nil {
 			common.Warn("SwitchChunks: failed to update chunk", zap.String("chunkID", chunkID), zap.Error(err))
 		}
 	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/chunk.go` around lines 1458 - 1474, The update loop
currently uses condition := {"id": chunkID} so it ignores the resolved document
scope (documentID) and can flip chunks from other documents; change the
condition used in s.docEngine.UpdateChunks to include the document identifier
(e.g., add "document_id": documentID or the appropriate field name) so the bulk
update in the loop (function using resolveDatasetAccess, variables chunkIDs,
available, indexName) is scoped to the requested document, leaving indexName and
datasetID usage unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/service/chunk.go`:
- Around line 1394-1438: When re-embedding (in the block using
getEmbeddingModelForKB, embedTexts, weightedVec) preserve existing persisted
questions if req.Questions is nil: read d["question_kwd"] (or the stored
question list) into the local questions variable when req.Questions == nil so
embedInput uses the persisted questions rather than empty content; if
req.Questions is provided keep current behavior and update d["question_kwd"] as
you already do. Ensure questions is populated before building embedInput and
before computing the weighted vector so vectors stay consistent with persisted
question_kwd.
- Around line 1285-1289: The code currently swallows errors from
s.documentDAO.UpdateByID after a successful chunk insert; capture and handle the
UpdateByID error instead of ignoring it — either wrap the chunk creation and the
UpdateByID call in a single DB transaction and rollback on any error, or check
the returned error from s.documentDAO.UpdateByID and return it (or propagate a
wrapped error) so the API reports failure; ensure you use the existing chunk
insert call and s.documentDAO.UpdateByID symbols and do not leave the counter
update silent on failure.
- Around line 1124-1136: The REST payload builds the result map using write-side
field names (e.g. "content_with_weight", "important_kwd", "question_kwd",
"docnm_kwd") which causes normalized search-hit data to be dropped; update the
map construction around the result variable so it reads the normalized
search-hit keys used elsewhere in this file's List/Get paths (e.g. "content",
"important_keywords", "questions", "docnm") while still using the existing
helpers (orSlice, orStr, intToBool) and datasetID/image_id handling; locate the
result map creation in chunk.go and replace the write-side keys with the
normalized names so responses include existing chunk values.
- Around line 1000-1022: getEmbeddingModelForKB currently errors when no
KB-specific embedding is set; instead reuse the existing resolveEmbeddingModel
fallback: when embdID is empty after the KB lookup in getEmbeddingModelForKB,
call resolveEmbeddingModel(tenantID) (or the service method that implements that
tenant-default resolution) to obtain the tenant default embdID/model, handle its
error, and then call modelProviderSvc.GetEmbeddingModel(tenantID, embdID).
Update error paths to preserve original error context from resolveEmbeddingModel
and remove the unconditional "no embedding model configured" error in favor of
the resolved result.

---

Duplicate comments:
In `@internal/service/chunk.go`:
- Around line 1458-1474: The update loop currently uses condition := {"id":
chunkID} so it ignores the resolved document scope (documentID) and can flip
chunks from other documents; change the condition used in
s.docEngine.UpdateChunks to include the document identifier (e.g., add
"document_id": documentID or the appropriate field name) so the bulk update in
the loop (function using resolveDatasetAccess, variables chunkIDs, available,
indexName) is scoped to the requested document, leaving indexName and datasetID
usage unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ece3b9fa-5f66-4d97-99e7-9c47e62197b9

📥 Commits

Reviewing files that changed from the base of the PR and between 64e7070 and 660805c.

📒 Files selected for processing (3)

internal/handler/chunk.go
internal/router/router.go
internal/service/chunk.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/handler/chunk.go

coderabbitai · 2026-06-08T10:24:15Z

+func (s *ChunkService) getEmbeddingModelForKB(kb *entity.Knowledgebase, tenantID string) (*models.EmbeddingModel, error) {
+	tenantLLMDAO := dao.NewTenantLLMDAO()
+	modelProviderSvc := NewModelProviderService()
+
+	var embdID string
+	var err error
+	if kb.TenantEmbdID != nil && *kb.TenantEmbdID > 0 {
+		_, embdID, err = dao.LookupTenantLLMByID(tenantLLMDAO, *kb.TenantEmbdID)
+	} else if kb.EmbdID != "" {
+		parts := strings.Split(kb.EmbdID, "@")
+		if len(parts) == 2 && parts[1] != "" {
+			_, embdID, err = dao.LookupTenantLLMByFactory(tenantLLMDAO, tenantID, parts[1], parts[0], entity.ModelTypeEmbedding)
+		} else {
+			_, embdID, err = dao.LookupTenantLLMByName(tenantLLMDAO, tenantID, kb.EmbdID, entity.ModelTypeEmbedding)
+		}
+	}
+	if err != nil {
+		return nil, fmt.Errorf("failed to resolve embedding model: %w", err)
+	}
+	if embdID == "" {
+		return nil, fmt.Errorf("no embedding model configured for dataset")
+	}
+	return modelProviderSvc.GetEmbeddingModel(tenantID, embdID)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reuse the existing embedding-model fallback here.

resolveEmbeddingModel already falls back to the tenant default when a KB has no explicit embedding model, but getEmbeddingModelForKB returns an error instead. That makes AddChunk and UpdateChunkREST fail for datasets that still rely on the tenant default embedding configuration.

🩹 Proposed fix

func (s *ChunkService) getEmbeddingModelForKB(kb *entity.Knowledgebase, tenantID string) (*models.EmbeddingModel, error) { - tenantLLMDAO := dao.NewTenantLLMDAO() - modelProviderSvc := NewModelProviderService() - - var embdID string - var err error - if kb.TenantEmbdID != nil && *kb.TenantEmbdID > 0 { - _, embdID, err = dao.LookupTenantLLMByID(tenantLLMDAO, *kb.TenantEmbdID) - } else if kb.EmbdID != "" { - parts := strings.Split(kb.EmbdID, "@") - if len(parts) == 2 && parts[1] != "" { - _, embdID, err = dao.LookupTenantLLMByFactory(tenantLLMDAO, tenantID, parts[1], parts[0], entity.ModelTypeEmbedding) - } else { - _, embdID, err = dao.LookupTenantLLMByName(tenantLLMDAO, tenantID, kb.EmbdID, entity.ModelTypeEmbedding) - } - } - if err != nil { - return nil, fmt.Errorf("failed to resolve embedding model: %w", err) - } - if embdID == "" { - return nil, fmt.Errorf("no embedding model configured for dataset") - } - return modelProviderSvc.GetEmbeddingModel(tenantID, embdID) + return s.resolveEmbeddingModel(tenantID, kb) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func (s *ChunkService) getEmbeddingModelForKB(kb *entity.Knowledgebase, tenantID string) (*models.EmbeddingModel, error) {

tenantLLMDAO := dao.NewTenantLLMDAO()

modelProviderSvc := NewModelProviderService()

var embdID string

var err error

if kb.TenantEmbdID != nil && *kb.TenantEmbdID > 0 {

_, embdID, err = dao.LookupTenantLLMByID(tenantLLMDAO, *kb.TenantEmbdID)

} else if kb.EmbdID != "" {

parts := strings.Split(kb.EmbdID, "@")

if len(parts) == 2 && parts[1] != "" {

_, embdID, err = dao.LookupTenantLLMByFactory(tenantLLMDAO, tenantID, parts[1], parts[0], entity.ModelTypeEmbedding)

} else {

_, embdID, err = dao.LookupTenantLLMByName(tenantLLMDAO, tenantID, kb.EmbdID, entity.ModelTypeEmbedding)

}

}

if err != nil {

return nil, fmt.Errorf("failed to resolve embedding model: %w", err)

}

if embdID == "" {

return nil, fmt.Errorf("no embedding model configured for dataset")

}

return modelProviderSvc.GetEmbeddingModel(tenantID, embdID)

func (s *ChunkService) getEmbeddingModelForKB(kb *entity.Knowledgebase, tenantID string) (*models.EmbeddingModel, error) {

return s.resolveEmbeddingModel(tenantID, kb)

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/service/chunk.go` around lines 1000 - 1022, getEmbeddingModelForKB currently errors when no KB-specific embedding is set; instead reuse the existing resolveEmbeddingModel fallback: when embdID is empty after the KB lookup in getEmbeddingModelForKB, call resolveEmbeddingModel(tenantID) (or the service method that implements that tenant-default resolution) to obtain the tenant default embdID/model, handle its error, and then call modelProviderSvc.GetEmbeddingModel(tenantID, embdID). Update error paths to preserve original error context from resolveEmbeddingModel and remove the unconditional "no embedding model configured" error in favor of the resolved result.

coderabbitai · 2026-06-08T10:24:15Z

+		result := map[string]interface{}{
+			"id":                 chunk["id"],
+			"content":            chunk["content_with_weight"],
+			"document_id":        chunk["doc_id"],
+			"docnm_kwd":          chunk["docnm_kwd"],
+			"important_keywords": orSlice(chunk["important_kwd"]),
+			"questions":          orSlice(chunk["question_kwd"]),
+			"tag_kwd":            orSlice(chunk["tag_kwd"]),
+			"dataset_id":         datasetID,
+			"image_id":           orStr(chunk["img_id"]),
+			"available":          intToBool(chunk["available_int"]),
+			"positions":          orSlice(chunk["position_int"]),
+		}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize search hits before building the REST payload.

This code reads write-side field names (content_with_weight, important_kwd, question_kwd, docnm_kwd), but the existing List/Get paths in this file normalize raw search-hit fields like content, important_keywords, questions, and docnm. As written, REST list responses will drop those values for existing chunks.

🩹 Proposed fix

result := map[string]interface{}{ "id": chunk["id"], - "content": chunk["content_with_weight"], + "content": utility.FirstNonNil(chunk["content_with_weight"], chunk["content"]), "document_id": chunk["doc_id"], - "docnm_kwd": chunk["docnm_kwd"], - "important_keywords": orSlice(chunk["important_kwd"]), - "questions": orSlice(chunk["question_kwd"]), + "docnm_kwd": utility.FirstNonNil(chunk["docnm_kwd"], chunk["docnm"]), + "important_keywords": orSlice(utility.FirstNonNil(chunk["important_kwd"], chunk["important_keywords"])), + "questions": orSlice(utility.FirstNonNil(chunk["question_kwd"], chunk["questions"])), "tag_kwd": orSlice(chunk["tag_kwd"]), "dataset_id": datasetID, "image_id": orStr(chunk["img_id"]), "available": intToBool(chunk["available_int"]), "positions": orSlice(chunk["position_int"]),

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/service/chunk.go` around lines 1124 - 1136, The REST payload builds the result map using write-side field names (e.g. "content_with_weight", "important_kwd", "question_kwd", "docnm_kwd") which causes normalized search-hit data to be dropped; update the map construction around the result variable so it reads the normalized search-hit keys used elsewhere in this file's List/Get paths (e.g. "content", "important_keywords", "questions", "docnm") while still using the existing helpers (orSlice, orStr, intToBool) and datasetID/image_id handling; locate the result map creation in chunk.go and replace the write-side keys with the normalized names so responses include existing chunk values.

coderabbitai · 2026-06-08T10:24:15Z

+	// Increment document chunk_num and token_num.
+	_ = s.documentDAO.UpdateByID(documentID, map[string]interface{}{
+		"chunk_num": gorm.Expr("chunk_num + 1"),
+		"token_num": gorm.Expr("token_num + ?", tokenCount),
+	})


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't swallow the document-counter write failure.

If the chunk insert succeeds but UpdateByID fails, chunk_num and token_num drift from the actual chunk store while the API still returns success. This silently breaks document stats and makes retries/reconciliation ambiguous.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/service/chunk.go` around lines 1285 - 1289, The code currently swallows errors from s.documentDAO.UpdateByID after a successful chunk insert; capture and handle the UpdateByID error instead of ignoring it — either wrap the chunk creation and the UpdateByID call in a single DB transaction and rollback on any error, or check the returned error from s.documentDAO.UpdateByID and return it (or propagate a wrapped error) so the API reports failure; ensure you use the existing chunk insert call and s.documentDAO.UpdateByID symbols and do not leave the counter update silent on failure.

coderabbitai · 2026-06-08T10:24:15Z

+	questions := []string{}
+	if req.Questions != nil {
+		for _, q := range req.Questions {
+			if q = strings.TrimSpace(q); q != "" {
+				questions = append(questions, q)
+			}
+		}
+		d["question_kwd"] = questions
+		d["question_tks"] = strings.Join(questions, "\n")
+	}
+
+	if req.Available != nil {
+		avInt := 0
+		if *req.Available {
+			avInt = 1
+		}
+		d["available_int"] = avInt
+	}
+	if req.Positions != nil {
+		d["position_int"] = req.Positions
+	}
+	if req.TagKwd != nil {
+		d["tag_kwd"] = req.TagKwd
+	}
+	if req.TagFeas != nil {
+		d["tag_feas"] = req.TagFeas
+	}
+
+	// Re-embed when content or questions changed.
+	if req.Content != nil || req.Questions != nil {
+		em, err := s.getEmbeddingModelForKB(kb, tenantID)
+		if err != nil {
+			return fmt.Errorf("failed to get embedding model: %w", err)
+		}
+		embedInput := content
+		if len(questions) > 0 {
+			embedInput = strings.Join(questions, "\n")
+		}
+		vecs, _, err := embedTexts(em, []string{doc.Name, embedInput})
+		if err != nil {
+			return fmt.Errorf("embedding failed: %w", err)
+		}
+		if len(vecs) >= 2 {
+			vec := weightedVec(vecs[0], vecs[1])
+			d[fmt.Sprintf("q_%d_vec", len(vec))] = vec


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve existing questions when only content changes.

questions starts empty and is only filled from req.Questions. If a chunk already has questions and the caller patches only content, the re-embed path falls back to raw content while question_kwd remains unchanged in storage. The vector then no longer matches the chunk's persisted question state.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/service/chunk.go` around lines 1394 - 1438, When re-embedding (in the block using getEmbeddingModelForKB, embedTexts, weightedVec) preserve existing persisted questions if req.Questions is nil: read d["question_kwd"] (or the stored question list) into the local questions variable when req.Questions == nil so embedInput uses the persisted questions rather than empty content; if req.Questions is provided keep current behavior and update d["question_kwd"] as you already do. Ensure questions is populated before building embedInput and before computing the weighted vector so vectors stay consistent with persisted question_kwd.

coderabbitai

♻️ Duplicate comments (1)

internal/service/chunk.go (1)
1480-1488: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

SwitchChunks does not validate that chunks belong to the specified document.

The documentID parameter is accepted but never used. The update condition at line 1482 contains only "id": chunkID, allowing a user with dataset access to toggle availability of chunks from any document in that dataset—not just the document specified in the URL path.

Add "doc_id": documentID to the condition map so the update is scoped to the correct document:
🔒 Proposed fix
 	for _, chunkID := range chunkIDs {
-		condition := map[string]interface{}{"id": chunkID}
+		condition := map[string]interface{}{"id": chunkID, "doc_id": documentID}
 		update := map[string]interface{}{"available_int": availInt}
 		if err := s.docEngine.UpdateChunks(ctx, condition, update, indexName, datasetID); err != nil {
 			common.Warn("SwitchChunks: failed to update chunk", zap.String("chunkID", chunkID), zap.Error(err))
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/chunk.go` around lines 1480 - 1488, SwitchChunks accepts
documentID but never scopes the chunk update to that document, so UpdateChunks
is called with condition {"id": chunkID} and can modify chunks from other
documents; update the condition map in the loop inside SwitchChunks to include
the document ID (e.g., add "doc_id": documentID) so the call to
s.docEngine.UpdateChunks(ctx, condition, update, indexName, datasetID) only
affects chunks belonging to the specified documentID (ensure the key name
matches the DB field used elsewhere).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@internal/service/chunk.go`:
- Around line 1480-1488: SwitchChunks accepts documentID but never scopes the
chunk update to that document, so UpdateChunks is called with condition {"id":
chunkID} and can modify chunks from other documents; update the condition map in
the loop inside SwitchChunks to include the document ID (e.g., add "doc_id":
documentID) so the call to s.docEngine.UpdateChunks(ctx, condition, update,
indexName, datasetID) only affects chunks belonging to the specified documentID
(ensure the key name matches the DB field used elsewhere).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 74ea85a6-36ef-419f-a4ec-ab3d3c1af790

📥 Commits

Reviewing files that changed from the base of the PR and between 660805c and 47e915f.

📒 Files selected for processing (3)

internal/handler/chunk.go
internal/router/router.go
internal/service/chunk.go

🚧 Files skipped from review as they are similar to previous changes (2)

internal/router/router.go
internal/handler/chunk.go

codecov · 2026-06-08T14:27:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.16%. Comparing base (93e4f6b) to head (c50554f).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #15670   +/-   ##
=======================================
  Coverage   93.16%   93.16%           
=======================================
  Files          10       10           
  Lines         717      717           
  Branches      118      118           
=======================================
  Hits          668      668           
  Misses         29       29           
  Partials       20       20

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

hunnyboy1217 · 2026-06-08T15:11:25Z

Hi, @Haruko386, @Hz-186,
All conflicts have been resolved, and CI tests have been successfully passed.

Hz-186

@hunnyboy1217

Thanks for the great effort in porting the chunk management APIs to Go!

I've conducted a thorough end-to-end API functional test between the existing Python API (:9380) and the new Go API (:9384).

I found 3 critical bugs that need to be fixed before merging:

1. Go API fails to resolve valid Embedding Models (AddChunk)
When calling POST /api/v1/datasets/.../chunks, the Python API successfully parses and embeds the chunk using the configured model (e.g. Qwen/Qwen3-Embedding-8B@test@SILICONFLOW).

However, the Go API fails with: failed to get embedding model... record not found. The model resolution logic in getEmbeddingModelForKB (or the underlying tenant LLM lookup) seems broken for complex model names/factories, blocking chunk creation entirely.

2. Silent Success on Non-existent Chunk IDs (SwitchChunks)
When sending PATCH /api/v1/datasets/.../chunks with a fake or invalid chunk_id (e.g., ["FAKE_ID"]):

Python API strictly returns code: 102 (Index updating failure).
Go API returns code: 0 (success).
The Go implementation is missing the validation to check if updated_count > 0 after hitting the document engine, giving users a false positive success.

3. Severe Schema Mismatch in ListChunksREST
The doc object returned in GET /api/v1/datasets/.../chunks diverges heavily in Go:

Wrong Keys: Go returns kb_id, chunk_num, and token_num. Python relies on dataset_id, chunk_count, and token_count.
Missing Metadata: The Go implementation strips out almost all critical document metadata (e.g., parser_config, process_duration, status, size, type). Python includes all of these. If left unfixed, this will cause the frontend console to crash due to missing fields.

Could you please address these logic alignments? Thanks again for your hard work!

Ports four chunk endpoints from Python to Go (infiniflow#15668): GET /api/v1/datasets/:dataset_id/documents/:document_id/chunks (ListChunksREST) POST /api/v1/datasets/:dataset_id/documents/:document_id/chunks (AddChunk) PATCH /api/v1/datasets/:dataset_id/documents/:document_id/chunks/:chunk_id (UpdateChunkREST) PATCH /api/v1/datasets/:dataset_id/documents/:document_id/chunks (SwitchChunks) Addresses review feedback (CodeRabbit) and the CI build break: - tokenizer.Tokenize returns (string, error); the bogus contentLtks.(string) assertion on a non-interface value (a compile error breaking CI) is removed — the string is passed straight to FineGrainedTokenize - ListChunksREST page/page_size now parse with strconv.Atoi instead of json.Number(...).Int64() - embedTexts no longer reports len(d.Embedding) (the fixed vector dimension) as token usage; token count is derived from the input text via the project tokenizer (estimateTokenCount), since the embedding driver exposes no provider token usage Rebased onto latest main, resolving the chunk.go import conflict with upstream's RetrievalTest refactor (now using service/nlp): merged the import union (keeping nlp from upstream and xxhash/gorm from this PR).

hunnyboy1217 · 2026-06-09T04:50:02Z

HI, @Hz-186 ,
Just updated, please review again. Thank you

dosubot Bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jun 4, 2026

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread internal/service/chunk.go Outdated

Comment thread internal/service/chunk.go

hunnyboy1217 force-pushed the feat/go-chunk-rest-apis branch from 722d990 to 52752d2 Compare June 8, 2026 05:22

Haruko386 assigned Haruko386 and unassigned Haruko386 Jun 8, 2026

Haruko386 self-requested a review June 8, 2026 05:45

Haruko386 marked this pull request as draft June 8, 2026 07:12

Haruko386 added the ci Continue Integration label Jun 8, 2026

Haruko386 marked this pull request as ready for review June 8, 2026 07:13

dosubot Bot added the 💞 feature Feature request, pull request that fullfill a new feature. label Jun 8, 2026

Haruko386 removed the 💞 feature Feature request, pull request that fullfill a new feature. label Jun 8, 2026

Haruko386 reviewed Jun 8, 2026

View reviewed changes

hunnyboy1217 force-pushed the feat/go-chunk-rest-apis branch from 64e7070 to fef32ae Compare June 8, 2026 10:12

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

hunnyboy1217 force-pushed the feat/go-chunk-rest-apis branch from 660805c to 47e915f Compare June 8, 2026 11:02

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

hunnyboy1217 force-pushed the feat/go-chunk-rest-apis branch 2 times, most recently from c2d8273 to 95d8fa7 Compare June 8, 2026 14:13

hunnyboy1217 requested a review from Haruko386 June 8, 2026 22:05

Hz-186 suggested changes Jun 9, 2026

View reviewed changes

hunnyboy1217 force-pushed the feat/go-chunk-rest-apis branch from 95d8fa7 to c50554f Compare June 9, 2026 04:47

Conversation

hunnyboy1217 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Type of change

Implementation notes

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram — AddChunk flow

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hunnyboy1217 commented Jun 5, 2026

Uh oh!

Haruko386 commented Jun 8, 2026

Uh oh!

Haruko386 left a comment

Choose a reason for hiding this comment

Uh oh!

Hz-186 commented Jun 8, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hunnyboy1217 commented Jun 8, 2026

Uh oh!

Hz-186 left a comment

Choose a reason for hiding this comment

Uh oh!

hunnyboy1217 commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hunnyboy1217 commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

codecov Bot commented Jun 8, 2026 •

edited

Loading